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AN EFFICIENT LINEAR PROGRAMMING METHOD FOR 
OPTIMAL TRANSPORTATION 


ADAM M. OBERMAN AND YUANLONG RUAN 

Abstract. An efficient method for computing solutions to the Optimal Trans¬ 
portation (OT) problem with a wide class of cost functions is presented. The 
standard linear programming (LP) discretization of the continuous problem 
becomes intractible for moderate grid sizes. A grid refinement method re¬ 
sults in a linear cost algorithm. Weak convergence of solutions is stablished. 
Barycentric projection of transference plans is used to improve the accuracy of 
solutions. The method is applied to more general problems, including partial 
optimal transportation, and barycenter problems. Computational examples 
validate the accuracy and efficiency of the method. Optimal maps between 
nonconvex domains, partial OT free boundaries, and high accuracy barycen- 
ters are presented. 
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1. Introduction 

The theory of Optimal Transportation (OT) is a powerful tool which defines 
a distance on probability measures, called the Wasserstein distance. It has pro¬ 
foundly impacted such fields as fluid mechanics, nonlinear diffusions, differential 
geometry, and it has found applications in areas such as reflector design, econom¬ 
ics, astrophysics, etc. |Eva991 IVilOSl IVilOS) . The related barycenter problem has 
been used for comparison of histograms of features for big data problems, and shape 
interpolation for computer graphics [AClli ISdGS~*~15lICPR15) . 

The Wasserstein distance cannot be evaluated analytically in most cases, and so 
numerical methods are needed. Existing numerical methods are either inaccurate, 
very costly (polynomial time), or too restrictive on the costs and measures to be 
widely applicable. These methods do not meet need the needs of modern applica¬ 
tions, which involve large problem sizes or require the solution of large numbers of 
OT problems. 

We introduce an efficient Linear Programming (LP) method for solving Optimal 
Transportation problems. Our method is not restricted to quadratic cost functions: 
it allows for a wide class of cost functions. By efficient, we mean linear cost in terms 
of solution time and memory requirements (see Table below). In practise we can 
compute solutions with problem sizes going up to half a million variables on a 
laptop computer using academic or commerical optimization software. Available 
parallel LP solvers allow the method to scale to even larger problem sizes. 

Two types of error are present in approximation problems which discretize a con- 
tinous (infinite dimensional) optimization problem. This first is the discretization 
error: the different between the exact solution of the discrete problem (at a given 
resolution) and the solulution of the limiting problem. Our discretization of the infi¬ 
nite dimensional LP is natural, allowing a weak convergence proof to be established 
using available stability results. The second type of error relates to how precisely 
the finite dimensional optimization problem is solved. In contrast to many existing 
methods, which solve surrogate or approximate optimzation problems, the method 
presented here compute the optimal solution of the discrete linear transportation 
problem, to within standard solver tolerances. 

A limitation of the Linear Programming approach to Optimal Transportation, 
compared to the Partial Differential Equations methods discussed below, is that 
maps are approximated by plans, which are multivalued (see Figure [^. We over¬ 
come this limitation using a theoretical tool, barycentric projection, which recovers 
the approximate map and dramatically improves the accuracy of solutions. 

The efficiency and accuracy of the method reveals solution features not otherwise 
available, including optimal maps between nonconvex sets, and for non-quadratic 
cost functions. The method is easily generalized to related problems, allowing for 
the computation of accurate free boundaries in partial optimal transportation, and 
high resolution barycenters for shapes. 

1.1. Background on the Optimal Transportation problem. The Monge for¬ 
mulation of the Optimal Transportation problem which seeks optimal maps while 
the Kantorovich formulation seeks optimal transference plans. Transference plans 
are a weaker notion of solution which can be computed using linear programming. 

Definition 1 (Givens for the OT problem). Given two probability measures, 

pL, V with bounded supports X,Y C respectively 
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and the cost function 

c{x, ?/) : X X y — >■ K. 

If the measures /r and v are absolutely continuous with respect to Lebesgue measure, 
write 

= fix)dx, diy{y) = g{y)dy 
for positive, Lebesgue measurable functions / and g. 

The goal is to rearrange one measure into the other, while minimizing the cost 
of mapping x to y, weighted by the amount of mass transported. There are two 
notions of rearrangement. 

1.2. The Monge formulation. In the Monge formulation, we consider the class 
of measurable, one-to-one mappings T : —>■ which rearrange g into v. If the 
measures y and v have smooth densities, and if T is continuously differentiable, the 
change of variable formula from multivariable calculus, leads to the rearrangement 
condition, 

(1) f{x)=g{T{x))\detWT{x)\ 

which is a fully nonlinear constraint on the mapping T. For general measures, the 
rearrangement condition is given by 

(2) = y[T~^{B)], for any measureable set B cY. 

We write v = T#y, and say that T transports y onto u. 

Monge’s problem is to minimize the total work corresponding to a map T over 
the set of all measurable maps T which transport y onto v. 


(3) 


Minimize/[T] = / c(x,T{x))dy{ 

Jx 


Subject to: T^y = v 


1.3. The Kantorovich Formulation. The Kantorovich formulation recasts the 
Optimal Transportation problem as an infinite dimensional Linear Program. Refer 
again to |Vil031 IEva99) . 

A transference plan generalizes a mapping, in order to allow for mass from a point 
X to be split into multiple parts. A transference plan is a probability measure, tt, 
on the product space X x Y, whose marginals are y and v. This means that 

7r[A xY\= y{A), Tr(X x B) = v{B) 

for all measurable subsets A of X and B of Y. The set of transference plans is 
written 


(4) 


n(/r, ix) = {n G V{X x Y) \ marginals of tt are y, v} 


Kantorovich’s formulation of the optimal transportation problem take the form of 
an infinite dimensional linear program 

(KLP) Minimize /[tt] = / c{x,y)d'K{x,y), for tt G n(/r, 

JxxY 

Since the constraints are linear, and the cost function is also linear, this is an infinite 
dimensional linear program. Under broad conditions, this problem has a minimal 
value, which is called the optimal transportation cost between y and v. However, 
in general, there may be more than one optimal transference plan. 
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Remark 1.1 (When an optimal plan is a map). For many costs, the transference 
plan solution from (KLP) is given by a map. This is the case for 

c{x,y) = h{\x-y\), 


where h is strictly convex and grows at super linear speed. In particular, c{x, y) = 
\x — y\^ for p > 1, which includes the important case p = 2. 

More generally, the twist condition, which requires that g{y) = Va;c(x, y) must be 
an injective function of y, ensures that the optimal plan is a map, see [GM96l[Car03| . 
or [VilOSl Chapter 10]. For other cost functions, an optimal transference plan need 
not be a map. In particular, this is the case for the Monge cost c{x,y) = ja: — y\. 


Example I.l (Monge-Ampere Partial Differential Equation). When c{x,y) = ja;— 
yp, the densities are smooth, and Y is convex, the Monge problem becomes a fully 
nonlinear Monge-Ampere PDE with unusual boundary conditions |Eva991 Section 
4] or [VilOSl Chapter 4]. The optimal map T = Vu is given by the gradient of a 
convex function, so that 0 becomes 


det(Zl^u(a:)) 


fix) 

y(Vu(a;)) 


along with the conditions Vu(A) = Y and the constraint that u(x) be convex. 


Example 1.2 (Assignment). A special case of the Monge problem is the assignment 
problem from combinatorial optimization |Sch03l Ch 17 and Ch 21], which is to 
find a minimum cost bipartite matching between two sets. See Figure a). If X 
and Y are discrete with n points each having the same mass, then 


M = 








where 5x represents the Dirac mass at the point x. In this case, the Monge problem 
(|^, becomes the assignment problem 

n 

Minimize '^^c{xi,yT(i)), over all permutations T of {I,...,n} 

i=l 


Example 1.3 (The discrete transportation problem). A special case of (KLP) is 
the transportation Linear Program, see Figure [^b). Civen discrete probability 


measures 


n m n m 

i—1 j — ^ i —1 


The cost function is given by the non-negative n x m matrix, c = (c^). A discrete 
transference plan, tt, is a non-negative n x m matrix whose marginals are y and v. 
The set of transference plans 0 becomes 





m 

n 'j 

(5) 

n(y,i/) = < 

q 

II 

^ ^ TTi j = Pi , 

i=i 

i=i J 


The transportation linear program is given by 


n m 

Minimize /[tt] = EE 

3^1 


(LP) 


for TT G n(y, v) 
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Figure 1 . (a) Illustration of a mapping from three points to three 
points, (b) An optimal transference plan can split points. 


The problem (LP) is easily written in the standard form for a linear program. 


Remark 1.2. Notice that the number of variables used to represent the densities in 
(LP) is n + m, and the number of variables for the plan tt is nm. So the size of the 


linear programming problem grows quadratically in the number of variables. 

Example 1.4 (Transportation recovers Assignment). Consider the discrete trans¬ 
portation problem (LP) where the measures are given by the equal weight measures 


from Example 1.2 In this case, (after removing the factor of 1/n) the set of plans, 
n, becomes the set of bistochastic matrices. By Choquet’s theorem, the solution 


of (LP) are extremal points of bistochastic matrices. By Birkhoff’s theorem, these 


are given by permutation matrices. Thus the optimal transference plans in Kan¬ 
torovich’s problem coincide with the permutation maps in Monge’s problem, and 
the solutions are the same. 


1.4. Related work. Rigorous approaches to solving the Optimal Transportation 
rely on different theoretical notions of the solution. Many solution methods are 
specific to the case of quadratic cost (distance squared), or the Monge cost (dis¬ 
tance). 


Partial Differential Equations. First consider the important special case of qua¬ 
dratic cost, c{x,y) = \x — j/p. The early Benamou-Brenier formulation leads to a 
fluid mechanics solver, by adding a synthetic time variable to the problem [BB00| . 
adding one dimemsion to the problem. The Monge-Ampere Partial Differential 
formulation was recently used to solve the OT problem using a convergent finite 
difference method [BF014| . This method places regularity requirements on the 
densities, one of which must have convex support. Our method has comparable 


accuracy to the PDF approach, see (4.3 


Earth Mover Distance. For the linear cost case, c{x,y) = \x — y\^ fast methods for 
computing approximate solutions are available [AIKn8| . These methods work by 
computing an approximate embedding of the 1-Wasserstein distance into Euclidean 
space with the metric. However the embedding induces a distortion of the metric 
which limits the accuracy of the method |NS07| . 
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Entropic regularization. Entropic regularization methods are a recently introduced 
method which modify the the optimization problem by adding a small multiple of an 
entropy term [Cutl3l ICPR15| to the objective (or cost) function. The regularized 
problem can be solved by a Bregman iterative solution method BCC^IS] , However, 
the number of iterations required for convergence increases as the regularization 
parameter goes to zero. The recent article SdGS+15] applied the method to large 
scale geometric problems. We compare the performance of our method to the 
entropic regularization method in §4.1[ Our method also avoids the blurring of 


barycenters introduced by the entropic regularization, see (4.6 


Linear Programming and Combinatorial Optimization. The linear programming 
problem implemented directly is simply too costly to be effective for solving the OT 
problem. For example, in the early paper [R.UOOj . problem sizes of up to = 15 
were solved. Current implementations still become very costly after N = 7200, 
see Table This is because the size of the problem grows quadratically in the 
number of variables, see Remark 1.2 An alternative approach is to instead solve 


the assignment problem [BC99| , approximating fractional weights for the measures 
by multiple assignents, see for example see |BC89] . However combinatorial opti¬ 
mization algorithms for the assigment problem do not seem to perform better than 
linear programming. In best cases, the complexity is worse than quadratic |Sch03j . 


Multiscale solvers. Multiscale solvers have previously been used to improve solver 
performance, but without achieving linear efficiency. Merigot solved a sequence 
of OT problems where the target is a sum of Diracs, improving performance by 
an order of magnitude |Merllj . Benamou and coauthors combines the approach 
of [BCC^l^ with a grid refinement procedure to solve larger scale problems in 
[BCN15| . In |C0014j a one step grid refinement was used to find the support of the 
barcyenter. Schmitzer |Schl5irSS13| used a grid refinement procedure which was be 
applied to both LP and combinatorial optmization solvers, improving performance 
by one order of magnitude. 


2. Discretization and convergence 


In this section we perform the discretization which reduces (KLP) to a finite 
dimensional Linear Program (LP). We prove that the solutions of the discrete 
problem converge weakly to the solution of (KLP I as the grid resolution parameter 
—>■ 0. Even when the limiting solution is a map, the weak solution of (LP) can 
be a plan. By using barycentric projection of the transference plan, we recover the 
map, resulting in improved accuracy. 


2.1. Discretization: finite dimensional linear programming. For claritiy 
and simplicity, we impose a Cartesian grid with uniform spacing h on the domains 
X and Y. The initial grids, X^, and are given by a uniform Cartesian grid with 
spacing h intersected with the support set, 

x^ = hzn A, y’^ = hzn y. 

We enumerate the grid points, (which could also be referred to as quadrature 
points), 

X^ = {xi,...,Xri}, Y^ = {yi,...,ym.}. 
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Each Xi is the center of a hypercube of width /i, 






Define the approximate measures to be a weighted sums of Dirac masses 

whose weights corresponds to the integral of the measures over the hypercubes of 
width h centred at 


( 6 ) 


m '* = XI = KRh{Xi)) 

m 

= X = v{Rh{yi)) 

2=1 


The discrete cost function is given by 


(7) dj = c{xi,yj), 

With these definitions, the infinite dimensional optimal transportation problem 
(KLPI is reduced to the finite dimensional linear programming problem (LP). We 
record it in the defintion which follows. 


Definition 2. Define 


at grid resolution h to mean the costs and measures 


are given by using § and Q. Write for an optimal solution 


of (LP) at resolution h. From we recover the approximation to the optimal 


transference plan tt by the imbedding 


( 8 ) 


“ X X 

i=l j=l 


Remark 2.1. A more accurate method to discretize the measures would be use a 
centroidal Voronoi tessellation of the support of the measures, and let Xi,yj be 
the weighted barycenters of the Voronoi regions |DFG99| , weights according to the 
measures /r and v, respectively. In this case, we would refer to the points as quad¬ 
rature points. This discretization of measures has been performed in |Merllj . This 
corresponds to a quantization problem, which is studied in the literature |GL0n] . 


2.2. Convergence of the linear progamming approximation. In this subsec¬ 
tion we prove that convergence of the solutions of (LP) to the solution of (KLP) 
as /i —?► 0. We use a fundamental stability result for solutions of the optimal trans¬ 
portation problem, along with the consistency of the approximation to obtain a 
weak convergence result. 

Given and c{x,y) as described in ^1.3 Gonsider a sequence hk —t 0. The 
following theorem is a special case of |Vil08[ Theorem 5.20]. 


Theorems (Stability of optimal transportation). Assume X andY are compact, 
and that c(x, y) is continuous. Let and be a sequence of measures that eon- 
verge weakly to p, and v, respectively. For each k let he an optimal transference 
plan between and v^’‘. Then, up to extraction of a subsequence, 

77^'“ convergences weakly to tt 


where tt is an optimal transference plan between p and v. 
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Theorem 4 (Convergence of Linear Programming Solutions). Suppose that X 
and Y are compact, and that c{x,y) is continuous. Let be a solution of (LP) 
given by the discretization @ 0 (H- Then, up to extraction of a subsequence, tt" 
convergences weakly to an optimal transference plan solution of (KLP|. 


Proof. The measures given by 0 are approximations by weighted Diracs masses 
over intervals whose volumes go to zero as h —>■ 0. So clearly these measures 
converge weakly to the limits p, and iz. 

The discrete cost function given by 0 can be extended to a continuous cost 
function defined on X x Y which converges uniformly to c{x,y). This can be 
accomplished, for example, by defining 


c{xi,yj) = Cij, i = l,...,n,j = l,...,m 


extend it continously to X x Y, by multilinear interpolation. 

The solution of (LP) results in an optimal transference plan tt^ given by 0. 
Apply the stability theorem above to obtain the desired weak convergence result. 

□ 


2.3. Recovering the map from the plan using barycentric projection. Even 
when the limiting transference plan tt is a map, the approximating discrete marginal 
TT^ will not be map, in general. However an approximation to the optimal map can 
be recovered from the transference plan using barycentric projection. 

The transference plan obtained from the linear programming solution is given 
by the imbedding 0 tt^ = ^ Thus the mass pi at Xi is transported to 

multiple points pj with weights given by piXi i—)■ '^ik^yk- 


Definition 5 (Barycentric projection of transference plan). Define for each i with 
Pi > 0, tji to be the Euclidean barycenter of the points [yi,... ,ym) with weights 

(tT^i, . . . 

E m 

_ k=l 'Xikyk 
Vi V—V m 

Z^fe=l 'Xik 

Then define the barycentric projection of the transference plan by 

n 

See Figure]^ 

See |AGS06[ Definition 5.4.2] for the continuous case. Then is a (discrete) 
map which pushes forward p^ to (an approximation) of . 


Remark 2.2. Generally, yt no longer belongs to the set {yi,... ,ym}- Correspond¬ 
ingly, while the first marginal of is equal to p^, the second marginal of is 
generally different from . However, it is still a weak approximation of v, which 
converges weakly to v as h ^ Q. 

Barycentric projection is discussed in [AGS06| . In particular, by Theorem 5.4.4 
and by Lemma 12.2.3, we can conclude that, for convex costs, the barycentric 
projection of the approximations converges to the barycentric projection of the 
limiting transference plan. In particular, when the unique limit is a map tt, then 
Tf^ converges to tt. 
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^ 1^2 ... yj ... 



Figure 2. Illustration of the transference plan tt = and the 
barycentric projection tt. The coloured band represents the sup¬ 
port of the transference plan. Points in the same row are bundled 
into one as indicated by the small circle. The curve in the middle 
represents the resulting map. 


3. Sparsity and a grid refinement procedure 


The full linear program (LP) which approximates (KLP) is too costly to solve 


for large problems. In the case where the solution from (KLPI is given by a map, 


we can expect that the discrete solution of (LP) will be sparse, provided h is small 
enough. By estimating the support of the transference plan, tt^, the number of 


variables in (LP) is reduced from quadratic to linear dependence on the inputs. 


Estimating the support in this manner forms the basis of the grid refinement 
procedure. This process is iterated, leading to a multiple step grid refinement 
procedure. 

By considering a sequence of linear programs parameterized by h we can find 
the support of the solution, tt^ , from the support of the solution corresponding to a 


smaller sized problem. Then we can solve the reduced size linear program, (LPR) 


below, with the expectation that the solution of the sparse problem is the same as 
that of the full problem. This will be the case if the previously estimated support 
is exact. We make a heurstic argument for why these supports should be the same, 
appealing to the stability property of linear programs under perturbation. But 
this argument is not rigorous in the limit /i —> 0, since the sensitivity of the linear 
program can depend on h. However, by growing the support we can test after 
each step if the support exceeds the estimated support. In practice, this has only 
happened for a the first stage for very small initial grid sizes. However, if this were 
to occur, the algorithm could be modified as needed. 
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□ Allowed support 

o Extrapolated support of previous solution 
° Support of current solution _ 



□□□□□□□□□□□□ 

□□□□□□□□□□□□ 

□□□□□□□□□□ 

30 40 

nz = 79 


60 


Figure 3. Illustration of the multiscale solution procedure. 


An alternative is to use the method of Schmitzer [Schl5| (see also |SS13| ), which 
ensures global optimality of the solution to the sparse subproblem by using locally 
adapted neighborhood sizes. 

We mention another apsect of the refinement technique which could be improved. 
In the interested of keeping the already complex code simpler, we refined using 
bisection in each square. An improvement in accuracy, and perhaps also the sensi¬ 
tivity of the refined problem would be to refine adaptively, either according to the 
densities, or according to the senstitivity of the linear program. In the latter case, 
we could also expect an inprovement in the accuracy. 


3.1. The sparse linear program. Let tt = Tr^ be the solution of (LP), and let 
Kq be the basis set, the indices of the nonzero entries of tt. If Kq C K, for a known 
set, K, then we can recover tt by solving the reduced linear program, 


(LPR) 


Minimize E ij 1 

{iKbileR"} 

Subject to: ^ 




TTy > 0, 


i G [l,...,n] 
j G [1,..., to] 
(b j) e K 


Remark 3.1. Note that, in contrast to (LP), the number of variables in ( |LPR | is 
|Ar|, and the number of constraints is |Ar| -|- n -|- to. In practice, |Ar| is a small 
multiple of n -|- TO. So the size of the reduced linear programming problem (LPR) 
grows linearly in the number of variables used to represent the measures. 


3.2. Multiscale solution procedure. The multiscale solution procedure is de¬ 
scribed here. Refer to Figure which illustrates the process for a one dimensional 
example. 
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(0) The full linear program ( |LP[ ), with and ^ — fj,^, v = as given 

by the quadrature rules <§ and 0. is solved on a coarse grid, x Y^. 

(1) From the discrete solution, tt^-, recover the spatial support set, 

Sh = {RhixD X Rh{y!}) I 4 > 0} , ShCX^x Y\ 

Grow the spatial support set, by including allowing neighbours (in space), 
Sh = { neighbours Sh}. 


Extract the corresponding indices 

Kh = { indices of center points in Sh}- 


(2) Refine the grids in each domain by a factor of two, labelling the new grids 

y^/2. In each coordinate of a; = (xi,..., Xd), the interval [xi — h, Xi + 
h] is halved, to become [xi — h, x^], [x^, x^ + /i], and the two new points are 
generated at the midpoints of the interval. Thus, the hypercube Rh{x) is 
divided evenly into 2“^ hypercubes, and the grid point x generates 2^ new 
points on the refined grid, each at the centres of the smaller hypercubes. 

(3) The allowed spatial support is interpolated onto the finer grid Sh /2 = A 
and the corresponding indices are extracted 


Khi 2 = { child indices of Kh}- 


(4) The sparsely supported linear programming problem (LPR) is solved with 

indices Kh /2 and with and given by the quadrature rules ^ 

and 0. 

(5) Repeat starting at Q, until a fine enough solution is computed. 

Remark 3.2. Depending on the coarseness of the approximations, Step 0 can be 
skipped, or repeated, so that neighbours of neighbours are included. The compu¬ 
tational cost of adding additional support indices is not significant. 

Step 0 is geometrically simple; algorithmically, an indexing formula is used to 
determine the child indices. 

In step 0 (|LPR) has \Kh/ 2 \ variables, which is on the order of 2‘^(n -I- m). 


3.3. Justification of the grid refinement procedure. In the case where the 
continuous plan tt is a map, from the convergence result of the linear programming 
approximation, we know that as h —>■ 0, the support of the discrete marginal tt^ is 
a weak approximation of the support of tt. Consequently, for Xi,j/j away from the 
support of TT, T^x-^y^ = 0 is zero, for h small enough. 

In practice, we found that the support of the refined solution was always con¬ 
tained in Kh/ 2 . Should the property ever fail, the code checks if the support tt^ 
reaches the boundary of the index set, Kh. 

In this section, we give an indication of when the estimation of the support given 
by Kh /2 is exact for the refined solution. 

Consider the standard form linear program. 

Minimize x 


(9) 


Subject to: 


( Ax = b 
\ X >0. 


Then we have the following stability result |BT971 Chapter 5]. 
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Theorem 6 (Stability of linear programming). If the standard form linear pro¬ 
gram © has a unique optimal solution x*, then the support of the optimal solution 
x*{c) is unchanged for nearby values c. If it has has a nondegenerate optimal basic 
solution X* , then the support of the optimal solution x*(b) is unchanged for nearby 
values b. 

In this context, a basic solution means the support is equal to the number of 
rows of A, which is assumed to be full row rank. This corresponds to n + m — 1 


nonzeros for (LP). 


Consider the following linear programming problems. 


Definition 7. Define (LP) or (LPR) at grid resolution h to mean the costs and 
measures are given by cC, using (|^ and ([^. Write tt^, for an optimal 


solution of (LP) at resolution h, h/2, respectively. Write for the solution of 


(LPR) at resolution h/2 using the indices K = Kfi /2 given by support estimation 
procedure, as in Step ([^. 

In the following lemma we show that there is a linear program with data close to 
that for (LP) at scale h/2, which has the correct support Kh /2 obtained from 


Lemma 8. Consider the linear programming solution There is a perturbation 

of the cost function c^/'^, for which the support of the solution, of the perturbed 

problem is contained in the indices Kf ,/2 obtained using Step (§ from the coarse 
solution TT^. The perturbed cost can be made arbitrarily close to by taking 
h^O. 


Proof. Dehne an auxiliary linear program from (LP) at scale h as follows. On the 


grid of scale h, each grid point Xi or yj is split into 2“ children. Denote an arbitrary 
child of Xi by Xi, and similarly for pj. Define the “doubled” linear program (LP), 
on the grid at scale h/2, as follows. The measures are still given by and 
The cost function is given by 


=c^ 

Xi,yj ^Xi,yp 


for any child grid points Xi, yj of Xi, yj 


Observe from the definition ([^, and from the definition of the child points Xi of Xi, 
that 


( 10 ) 


h h/2 


Vi 


E 


h 2 

. 

Vi 


Now having defined with the data above, let be a solution. Note that 
there are multiple solutions, since pairs of child vertices have equal costs. Recall 
that tC from Dehnitionis the solution of (LP) at scale h. Observe that is a 
splitting of TT^, in the sense that 


,V3 


= E 


..h/2 


,V3 


where the sum is over all child grid points Xi of Xi and pj of yj. This follows from 
the fact that the cost c is constant over pairs of child grid points, and also, since 


the mass of the marginals is split amongst the children (10). 

So the solution of the linear program, corresponds to a redistribution of 

the marginal tt^ to to child vertices based on the refinement of the measures. This 
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means that the support of is contained in K-^j^ from Step (§. (It could be 
smaller, which could arise, for example, if some [ixi — 0.) 

Next observe that (|LP[ ) at resolution h/2 and the “doubled” linear program are 
in the standard form ([^ with the same values for A and 6, but different values of c. 

Recalling the definition of the costs, Q, at any point Xi,yj, they differ by at 
most c{xi,yj) — c{xi + x^yj + y) where |a;|oo, |2/|oo < h/2. Since by assumption the 
cost function is continuous, this difference can be made arbitrarily small. □ 


By the stability theorem for linear programming, if is the unique optimal 
solution, of (LP) at scale h/2, and if is close enough to then the support 
of is the same as the support of which is contained in K}^/ 2 - So the 
sparse linear programming solution has the same support as and, since 


the data is the same for and 


h/'i _ 


'if 


To summarize, in the case where we can apply the stability theorem, as above, we 
can conclude that the solution of the sparse linear program ( LPR[ ) with support 
Kh /2 is equal to the solution of the full problem (LP) at resolution h/2. 


4. Numerical results 


4.1. Performance. We used MATLAB to generate the full and sparse LP prob¬ 
lems and to call the LP solver. We tried several LP solvers, and found that several 
commercial products (MOSEK, Gurobi, CLPEX) performed equally well. We called 
these products from CVX, and also called them directly from MATLAB. 

For benchmarking performance we used the example of transporting uniform 
densities from a square to diamond, see Figure The LPs were generated in 
MATLAB using Gurobi as a solver. Table compares computation time and mem¬ 
ory usage of the multigrid LP solver the full LP solver and the entropy method, 
implemented using sample code provided by the authors of [BGG+lb] . Computa¬ 
tion time and memory usage grow linearly with the number of unknowns, 2N‘^, 
for the Linear Program, as shown in Figure]^ We used a PC laptop (i3 1.9GHz 
CPU with 12GB RAM). For comparison, we also ran the example with N = 512 
on a 2015 MacBook Air (2.2GHz intel i7 CPU with 8GB RAM) and the run time 
decreased by about 30%. 

For the full solver, the memory usage with = 64 or 8192 variables was 7 GB 
and the computation time was nearly 527 seconds. In comparison, the multigrid 
method was an improvement in both time and memory of two orders of magnitude. 
The largest sized problem we computed using the multiscale solver corresponded 
to N = 512, or about half a million variables. Compared to the largest problem 
for the full solver, this is an increase of problem size two orders of magnitude. The 
performance of the entropy method was faster than the full solver, by about one 
order of magnitude, and allowed for a problem size of about double the maximum 
problem size for the full LP solver. Compared to the entropy method at the largest 
problem size, N = 96, or 18432 variables, the multiscale solver was about 50 times 
faster, and used about 40 times less memory. A comparable run time corresponded 
to a problem size of A^ = 512 or 524288 variables. The method in SdCS~*~l^ 
introduced performance improvements over the implementation of [BCC~*~1^ as well 
as hardware improvement. They used a computer with 23.5GB RAM, implemented 
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Figure 4. CPU time and memory usage of the multiscale linear 
programming solver. Both memory usage and computation time 
scale linearly with problem size. 


on a GPU, (which is significantly faster than a CPU), the run times reported here, 
which were limited to smaller problem sizes, were comparable to ours. 


N (grid) 


MGLP 


ELP 


ER 

2N^ 

CPU 

Memory 

CPU 

Memory 

CPU 

Memory 

32 

2048 

0.9 

14 

14.0 

400 

3.8 

900 

48 

4608 

1.6 

40 

95.6 

2000 

20.1 

1200 

64 

8192 

3.2 

80 

527.1 

7000 

64.4 

1900 

96 

18432 

7.0 

160 

* 

* 

310.1 

6200 

128 

32768 

13.5 

300 



* 

* 

192 

73728 

35.3 

700 





256 

131072 

58.9 

1100 





384 

294912 

165.5 

2500 





512 

524288 

287.6 

4000 






Table 1. Comparison of run time and memory usage for the 
multigrid LP (MGLP), the full LP (FLP), and the entropic reg¬ 
ularization (ER) method, the latter with precision to 10“"^ and 
regularization parameter e = 10“^ l |BCG~*~15| b Precision for the 
LP solvers are both 10“®. Memory usage for the LP solvers are as 
reported by Gurobi, and is estimated for the ER method. 


4.2. Smoothing the density. The target density obtained by the push forward of 
the barycentric projection of tt^ is still given as a weighted sum of Dirac masses. We 
can improve the accuracy of the result by smoothing the density, in other words by 
simply replacing each Dirac mass by a sum of Gaussians with standard deviation on 
the same scale as h. We took values between 5 and 9 multiples of h. More general 
methods for Kernel Density Estimation in the context of statistical inference are 
discussed in [Sil86) . 

Eigure is a comparison of optimal transport map before and after applying 
Gaussian filter using the map between uniform densities on a square and a diamond, 
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Figure 5. Top: Optimal transport map from square to diamond 
recovered by barycentric projection. Left: Before smoothing. 

Right: After smoothing. Bottom: Barycenter of a square and 
2 Cylinders. Left: Before smoothing. Right: After smoothing. 

discussed further in §4.4| The second part of the hgure shows smoothing of a 
computed barycenter, see §4.6[ 


4.3. Accuracy of the solutions. In this section we report on the accuracy of 
solutions in terms of the grid spacing h, by considering exact solutions to the 
Monge-Ampere PDF. These examples were used in |BF014| . The first example is 
for a variable density. The second example the well-known example of Caffarelli 
which splits a circle. In the case the mapping consists of two translations. In these 
examples, both maximum and L 2 error decrease approximately linearly with grid 
size which is comparable with the results in |BF014| . The solutions are plotted in 
Figure]^ 


Example 4.1. Transportation between two squares of the same size. The target has 
constant density, while the source square (Figure]^ has a density given by 

f{x,y) = 1 + 4: {q"{x)q{y) + q{x)q"{y)) + 16 {q{x)q{y)q" {x)q" {y) - (^{xfq'{yf) , 


where 

q{z) 



1 1 \ 

2567r3 3271 ) 


cos(87rz) 


32t2^ 


sin(87r2:); 
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Figure 6. Optimal transport map recovered by barycentric pro¬ 


jection. Left: Target of Exam ple |4.1| 
|4.2| Right: Target of Example |4.^ 


Middle: Source of Example 


The optimal transportation map is given by 

f = x + 4,q'{x)q{y) 

1 Uy {x,y) = y + iq{x)q'{y) 

The accuracy of the solution is presented in Table 

Example 4.2. Eigurej^ shows a circle is split into two half circles. The potential is 
given by 

V{xi,X2) = -i\xi\ +xl + xl); 

The accuracy of the solution is presented in Table 


N (gridsize) 


Max error Ex 
L 2 error Ex 
Max error Ex 
L 2 error Ex 


4.1 


4.1 






32 

0.00721 

0.00385 

0.0625 

0.01211 


64 

0.00892 

0.00379 

0.03125 

0.00302 


128 

0.00689 

0.00257 

0.01563 

0.00148 


256 

0.00241 

0.00103 

0.00781 

0.00050 


512 

0.00148 

0.00047 

0.00391 

0.00018 


Table 2. Accuracy of the solution for Example |4 . 1 1 and |4 . 2 


4.4. Numerical solutions of Optimal Transportation Problem. Visualiz¬ 
ing OT maps requires care. Our strategy for visualizing maps begins, following 
[BE014] , by taking the source domain to be a square with uniform density, and by 
pushing forward the image of the grid lines onto the target domain. Subsequently, 
we consider more complicated source domains, still mapping horizontal and vertical 
lines from the target. We also label points in the source as well as their images for 
clarification. We suggest spending time inspecting the images. 

We begin by computing examples of for the Optimal Transportation problem 
with quadratic cost c(a:, y) = \x — j/p. With restrictions on the measures explained 
in Example o the solution is given by solving the Monge-Ampere Partial Dif¬ 
ferential Equation. These maps were computed in [BE014j . and we refer to this 
paper for visualizations of maps to convex targets. The first examples are uniform 
densities. Example |4.3| shows the optimal map from a square to a non-convex tar¬ 
get. Example |4.4| shows the optimal map from a square to a diamond. Example |4.5| 
shows a map between two non-convex domains. 
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In Example 4.6 , we consider costs of the form c{x, y) = \x 
piecewise constant density on the source rectangle. 


y\’P, and consider a 


Example 4.3. First consider the OT problem with y, the uniform density on a square 
and V to the uniform density on the “Pac-Man”, a circle with a section of 7r/2 
radians removed. The mapping is presented in Figure The map is discontinuous 
on a line segment market OG on the square, which is mapped to two lines in 
the target, from the center O to halfway down the radial segments. The map is 
continuous on the rest of the domain. Near the points E, F in the source, vertical 
lines of the form x = c are mapped to corners, lines of the form x = \y\+ c. 

Example 4.4. Figure shows the optimal map from y the uniform density on a 
square, X, to y, the uniform density on a diamond. The source square is not 
plotted, but it can be seen in Figure The map shares the symmetries of the 
target. But while the central horizontal and vertical lines are mapped to straight 
lines, lines on the boundary of the square bend to the corners. Squares near the 
center of X are mapped to squares, while squares need the edges of X are mapped 
to rectangular shapes with large aspect ratios. 

Example 4.5. Refer to Figure The source domain, X is given by a square with 
a medium square removed from the center, and with smaller squares removed from 
the four corners. The target domain, T, is the set < 1} with an 

inner diamond shape removed, given by {\x\ + \y\ < 9/25}. 

We mapped the vertical and horizontal lines in the source domain to the target 
domain. The problem was solved on a 512^ grid, but plotted on a 256^ grid for 
clarity. We indicated the images of eight points, the inner and outer diagonal 
corners of X. Both shapes are symmetric about diagonal lines, and this symmetry 
is preserved by the map. The inner points in the target domain show discontinuities 
in the inverse map; points from far away in the source are mapped to points near, 
for example. A'. The dent near A! indicates the discontinuity: as more lines are 
used in the plot the size of the dent decreases. 


Example 4.6. In this example we vary the cost function. Refer to Figure which 
is plotted for a grid of size 256^. The source density y is piecewise constant on the 
rectangle X = i? = [—1,1] x [—1/2,1/2]. It is given, up to normalization, by 

The regions are illustrated in the figure by color. Note that the density is lowest 
on the blue middle strip = and highest in the red section. 

The target density v is uniform on 

The grid lines on X are mapped to T, along with the colors of the lines. Since 
the density is not uniform on X, squares coming from higher density areas will be 
mapped to larger regions, compared to squares from lower density areas. 

Consider the second figure, which corresponds to p = 2. In this case the low 
density middle strip of the rectangle, i?m, is mapped to a slightly convex vertical 
middle strip in the target. The higher density edges go to corresponding vertical 
cap shapes in the target. A square in the center of R is mapped to a rectangle in 
Y of aspect ratio about 2. 




Figure 7. Top: Optimal map from the square to “Pac-Man”. 
Bottom: Optimal map between two non-convex shapes. 


As p increases, a square in the center of R is mapped to a shape in Y with 
increasing height and decreasing width. The image of the boundary vertical lines 
|a;| = I in Rm go from convex (p = 2) to nearly linear (p = 3) to concave (p = 6). 

Consider the first image, which corresponds to p = 1.05. In this case, Rm is 
mapped to an elongated diamond shape in the middle of Y. The midpoint of 
the top and bottom edges are mapped into the interior of the target, and their 
horizontal neighbors are folded in on each other. The two vertical dents the top 
and bottom of the target illustrate this folding, which means the inverse map from 
Y to i? is discontinous. (Similar results were obtained when the source was changed 
to a rounded rectangle.) 
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Figure 8. Top: source density is piecewise constant on a rec¬ 
tangle, with different values indicated by color. Optimal map from 
source to uniform density on the target, with cost c{x, y) = \x — y\^^ 
for p = 1.05, 2,3,6, from left to right and top to bottom. 


4.5. Partial Optimal Transportation. The partial optimal transportation prob¬ 
lem, studied by Caffarelli and McCann [CMIO] and Figalli |FiglO| , extends the OT 
problem with quadratic costs to the case where there is an excess of mass, and only 
a given fraction is to be transported. The result is a free boundary problem to 
determine which mass should be transported to minimize the total transportation 
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cost. Uniqueness of the mapping was established in [CMIO] when X and Y are dis¬ 
joint, and then by Figalli for when XC\Y when the total mass to be transported 
is in excess of the measure on the intersection. Figalli proved local regularity of 
the free boundaries away from the intersection when X and Y are strictly convex. 


The linear programming discretization (LPI is easily modified to include par¬ 


tial mass transportation, by replacing the marginal projection equalities with an 
inequality, along with a single equality constraint to reflect the total mass to be 
transported. 


Minimize /[tt] = EE' 




Z = 1 j = l 


Subject to: < 


i=i 

n 

i=l 

J2ij 

TTij ^ 0 . 


i G {l,...,n}, 
j G {!,..., to}. 


We consider simple geometries for X and Y, using uniform measures on the do¬ 
mains. We verified that free boundaries are never mapped to free boundaries, which 
is consistent with Figalli’s prediction. 

Each plot shows multiple free boundaries, corresponding to increasing the total 
mass to be transported. See Figure Note that the apparent kinks in the free 
boundary lines are plotting artifacts, since we plotted the support squares of the 
free boundaries without smoothing. It would also be reasonable to smooth the free 
boundaries using piecewise linear interpolation. 


Example 4.7. Here X and Y are upward and downward facing parabolas, separated 
by a gap. The plot shows several the partial optimal transportation free boundaries 
as the mass to be transported is increased. The free boundary is both flat, and 
smooth. 


The remaining examples consider X and Y which intersect. In these cases, the 
total mass to be transported is expressed as a multiple of the total mass of the 
overlap. 

Example 4.8. Here X and Y are intersecting squares. The problem was solved on 
a 512^ grid. For small excess mass ratios, the partial transportation free boundary 
is small set near the corners of overlap of the squares. Only when the excess mass 
ratio is increased to 1.11 does the free boundary cover the full intersection of the 
squares. 

Example 4.9. Now X and Y are upward and downward facing parabolas which 
overlap. We see very different results compared to the case with squares. In this 
case, even for very small excess mass ratios, the partial transportation free boundary 
contains the intersection set X HY. We plotted two figures. In the first, the mass 
ratio goes from 1.01 to 1.11, with steps of 0.02. For a mass ratio of 1.01, the excess 
mass is not large compared to the resolution of the grid, but it is clear that as the 
mass is increased to 1.11, the full intersection is transported. In the second figure, 
we steps of 0.04 in the mass ratio, going from 1.05 to 1.25. At the mass ratio of 
1.05, the free boundary is several grid points away from the intersection set. 
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Notice the difference between the two examples. At small excess mas ratios, for 
the squares, the free boundary is clearly a subset of the intersection, while for the 
parabolas, it covers the entire intersection. An obvious difference between the two 
examples is that the furthest point of the parabolas are not nearly as far away as 
the furthest points of the squares. Another way to describe the difference is in 
terms of the tangents to the free boundaries. For the parabolas, the tangents to the 
free boundaries change very little, from a small positive slope to a small negative 
one. For the squares, the tangents to the free boundaries go from nearly horizontal, 
to nearly vertical. 

4.6. Barycenter. Given the points x = (xi,...,Xn) and the positive weights 
which sum to unity, the Euclidean barycenter fi = fJ-iXi is the 

minimizer of 

n 

I{x) = - Xi\‘^. 

i=l 

The Wasserstein distance allows us to define the barycenter of multiple measures. 

By analogy, given the measures pi,..., and the positive weights pi,..., /j,„, 
which sum to unity, the barycenter is the minimizer over probability measures of 

n 

J{P) = 

i=l 

where W 2 is the value of the optimal transportation problem with quadratic costs. 
Variants include using other costs, c{x,y) = |a; — y\P. Refer to Agueh and Car- 
lier |AC11| . The barycenter problem can be discretized as a linear program, 
see |G()()14] . 

Example 4.10. The barycenter of three shapes is shown in FigureThis example 
is taken from [BCC^l^ Figure 7], where it was computed on a lower resolution 
grid. We also computed the barycenters of two rectangles, as in |C0014j and found 
very similar results (not plotted). 

Example 4.11. Figure [T0| shows the barycenter of three sections of an annulus. The 
support of the barycenter get smaller as p increases. The barycenter inherits some 
of the symmetries of the measures. 

Example 4.12. The next example shows a range of barycenters of uniform measures 
supported on four shapes, as in |SdGS~*~l,^ . Figure [TT] shows the solution computed 
on a 1024^ grid, without smoothing. The solution is accurate, and the boundary 
of the shapes are sharp, compared to the blurring of boundaries introduced by 
entropic regularization. 


5. Conclusions 

We introduced an efficient Linear Programming (LP) method for solving Optimal 
Transportation problems on general domains for a wide range of cost functions. The 
method is linear in complexity (in terms of both time and memory), and allowed 
for problem sizes of up to half a million variables to be solved in a few minutes on 
a laptop. Available parallel LP solvers allow even larger problems to be solved. 

The method was accurate enough to compare solution features for optimal trans¬ 
portation maps between non-convex shapes for different costs functions. For qua¬ 
dratic costs, the results are comparable to the best available method, but without 
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0.02 mass ratio 
0.04 mass ratio 
0.06 mass ratio 
0.08 mass ratio 
0.10 mass ratio 
0.12 mass ratio 



- 

1 1 1 1 1 1 1 

- 1.01 mass ratio 

- 1.03 mass ratio 

1.05 mass ratio 

-1.07 mass ratio 

1.09 mass ratio 

1.11 mass ratio 



_ 

- 




-0.8 -0.6 -0.4 -0.2 


0.2 0.4 0.6 0.8 



- 1.01 mass ratio 

1.03 mass ratio 

0.8 


- 1.05 mass ratio 

1.09 mass ratio 

- 1.05 mass ratio 

1.07 mass ratio 

- 1.09 mass ratio 

1.11 mass ratio 

0.6 


- 1.13 mass ratio 

1.17 mass ratio 

- 1.21 mass ratio 

1.25 mass ratio 



-0.8 -0.6 -0.4 -0.2 


0.2 0.4 0.6 0.8 


Figure 9. Partial optimal transportation free boundaries corre¬ 
sponding to different levels of transported mass. Top left: measures 
supported on non-overlapping on parabolas. Top right: Overlap¬ 
ping squares. Bottom: overlapping paraboloids. 


the restriction of convexity of the target domain. For other costs, our method is 
significantly faster and more accurate than available methods. We proved weak 
convergence of the approximation, and implemented the barycentric projection of 
the measure to improve the accuracy of the mapping. 

The method was also applied to generalizations of the Optimal Transportation 
problem, including barycenters and partial Optimal Transportation. Other gener¬ 
alizations are possible, as long as these probems can be expressed as Linear Pro¬ 
grams. 
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Figure 10. Barycenter of three sections of an annulus. From left 
to right and top to bottom: cost c{x, y) = \x — y\^,p= 1.05, 2, 5, 9. 
Grid size 512^. 
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Figure 11. Barycenters of the four shapes in the corners, com¬ 
puted on a grid of size 1024^. 
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